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Raymond  J.  Carroll  Peter  Hall 
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SUMMARY.  Suppose  we  observe  the  sum  of  two  independent  random  variables  X  and  Z, 
where  Z  denotes  measurement  error  and  has  a  known  distribution,  and  where  the  unknown 
density  /  of  X  is  to  be  estimated.  It  is  shown  that  if  Z  is  normally  distributed  and  if  /  has 
k  bounded  derivatives,  then  the  fastest  attainable  convergence  rate  of  any  nonparametric 
estimator  of  /  is  only  (log  n)  "  *  t2 .  Therefore  deconvolution  with  normal  errors  may  not 
be  a  practical  proposition.  Other  error  distributions  are  also  treated.  Stefanski-Carroll 
(1987b)  estimators  achieve  the  optimal  rates.  Our  results  have  versions  for  multiplicative 
errors,  where  they  imply  that  even  optimal  rates  are  exceptionally  slow. 

KEY  WORDS.  Deconvolution;  density  estimation;  errors-in-variables;  measurement  er¬ 
ror;  rates  of  convergence. 
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1.  INTRODUCTION 


Suppose  we  wish  to  gain  information  about  the  density  /  of  a  random  variable  X,  but 
because  of  measurement  error  can  only  observe  Y  =  X  +  Z,  where  the  measurement  error 
Z  is  independent  of  X.  Assume  Z  has  a  known  density  function  fz  with  characteristic 
function  <f>z .  Our  paper  addresses  the  question:  from  a  sample  Yi,. . .  ,Yn,  how  well  can  / 
be  estimated? 

Applied  problems  in  which  knowledge  of  /  is  required  are  discussed  by  Mendelsohn 
&  Rice  (1983),  see  also  Meagyessy  (1977).  Nonparametric  estimates  of  /  are  discussed  by 
Stefanski  &  Carroll  (1987b). 

An  application  of  our  results  is  to  the  nonparametric  Empirical  Bayes  problem,  see 
Maritz  (1970)  and  Berger  (1980).  Here  /  represents  the  prior  distribution  for  a  sequence 
of  location  parameters  Xt , . . . ,  Xn .  The  idea  is  to  estimate  the  prior  nonparametrically, 
as  opposed  to  the  alternative  device  of  specifying  a  parametric  form  for  the  prior  with 
parameters  to  be  estimated.  Our  paper  addresses  the  question:  how  well  can  a  prior  be 
estimated  nonparametrically? 

Another  application  is  to  the  problem  of  measurement  error  models  (errors-in-variables)l 
for  nonlinear  regression  and  generalized  linear  models;  see  Stefanski  &  Carroll  (1987a). 
Other  recent  papers  include  Carroll  et  al.  (1984),  Stefanski  &  Carroll  (1985),  Stefanski 
(1985)  and  Schafer  (1987).  In  this  problem,  X  is  the  true  predictor  but  because  of  mea¬ 
surement  error  Z  we  can  observe  only  Y  =  X  +  Z.  While  the  middle  two  references  use 
a  sensitivity  analysis  approach,  Carroll  et  al.  (1984)  and  Schafer  (1987)  assume  a  specific 
distributional  form  for  /.  Our  paper  addresses  the  question  of  how  well  the  data  can  be 
used  in  a  nonparametric  way  to  suggest  a  parametric  form  for  /.  Schafer  (1987)  shows  that 
in  generalized  linear  models,  the  EM  algorithm  for  maximum  likelihood  requires  knowl¬ 
edge  of  the  first  two  conditional  moments  of  X  given  Y  and  the  response  variable  in  the 
generalized  linear  model.  Other  problems  will  require  the  conditional  moments  of  X  given 
Y .  In  either  case,  how  well  these  conditional  moments  can  be  estimated  from  data  depends 
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crucially  on  how  well  /  can  be  estimated  from  data. 

The  case  of  normal  measurement  error  is  particularly  important.  We  show  in  this 
paper  that  if  /  has  k  bounded  derivatives,  and  if  errors  are  normal,  then  the  fastest  rate 
of  convergence  of  any  estimator  of  /  is  only  (logn)_fc/2,  and  that  this  rate  is  achieved 
by  a  kernel  estimator  of  Stefanski-Carroll  (1987b)  type.  This  very  slow  rate  suggests 
that  deconvolution  may  not  be  a  practical  procedure  with  normal  errors,  even  if  optimal 
estimators  are  employed.  With  k  =  2,  it  also  follows  that  the  best  achievable  rate  for 
estimating  the  distribution  function  of  X  can  be  no  faster  than  (logn)-3/2.  Thus,  even 
estimating  probabilities  for  X  is  difficult. 

We  also  show  that  Stefanski-Carroll  estimators  attain  optimal  convergence  rates  for 
many  other  error  distributions,  such  as  gamma,  exponential  and  double  exponential.  For 
example,  the  optimal  achievable  rate  in  the  double  exponential  case  is  n_t/(2fc  +  6l.  Our 
results  indicate  that  if  the  error  density  is  compactly  supported  and  infinitely  differentiable 
then  the  optimal  convergence  rate  is  slower  than  n~°  for  any  a  >  0.  Deconvolving  a 
density  with  smooth  measurement  error  is  intrinsically  difficult,  with  convergence  rates 
much  slower  than  those  usually  encountered  in  density  estimation. 

Our  results  have  obvious  implications  for  models  with  multiplicative  error,  Y  =  XZ, 
which  may  be  expressed  additively  by  taking  logs.  The  density  of  log  Z  is  infinitely  dif¬ 
ferentiable  in  many  important  cases,  such  as  when  Z  is  gamma  or  lognormal,  and  so 
convergence  rates  are  extremely  slow.  Hence,  deconvolution  is  difficult  when  errors  are 
multiplicative. 

Of  course,  our  lower  bounds  to  convergence  rates  continue  to  apply  when  error  dis¬ 
tributions  are  known  imperfectly,  for  example  whan  errors  are  normal  with  unknown  vari¬ 
ance.  In  such  cases,  where  the  error  distribution  is  specified  up  to  estimable  parameters, 
the  distribution  can  often  be  estimated  -consistently  by  replication.  Since  estimators 
of  the  X-density  /  converge  at  rates  considerably  slower  than  n~  » ,  replacing  the  true 
error  distribution  by  its  estimated  version  does  not  measurably  affect  convergence  rates 
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of  Stefanski-Carroll  estimators.  Hence,  both  our  lower  and  upper  bounds  to  convergence 


rates  apply  when  error  distributions  are  imperfectly  specified,  up  to  a  parametric  form. 


The  next  section  gives  details  of  our  calculations  in  the  case  of  normal  measurement 


errors.  In  section  3  we  briefly  discuss  other  error  distributions. 


2.  DECONVOLUTION  WHEN  ERRORS  ARE  NORMAL 


Write  Ck{B )  for  the  class  of  fc-times  differentiable  densities  /  having  sup /  <  B  and 
sup|/(fc)|  <  B.  Let  X  have  density  /,  Z  be  normal  7V(0, 1)  independent  of  X,  and 
Y  =  X+Z.  The  following  theorem  provides  bounds  to  the  accuracy  with  which  /  6  Ck  ( B ) 


can  be  estimated  from  an  n-sample  of  Y’s. 


Let  x0  be  any  real  number,  and  f(x0)  be  any  nonparametric  estimator  of  f(x0),  based 


on  an  n-sample  of  Y’s. 


Theorem  1.  Assume  that  the  error  distribution  is  normal  N( 0, 1).  If,  for  some  sequence 
of  positive  constants  {c„ ,  n  >  1},  we  have 


liminf  inf  P/  {|/(*0)  -  /(*o)|  <  an  }  =  1 

n-oo  / eCk[B ) 


for  each  B  >  0,  then 


lira  (log  n)k/2an  =  oo  . 


Theorem  1  declares  that  the  rate  of  convergence  of  /  to  /  cannot  be  faster  than 
(logn)-fc/2,  over  densities  in  Ck(B).  Kernel  estimators  attaining  this  rate  of  convergence 
may  be  constructed  as  follows;  see  Stefanski  and  Carroll  (1987b).  Let  G  be  a  symmetric 
function  vanishing  outside  (—1,1),  having  k  +  2  bounded  derivatives  on  (—00,00),  and 
satisfying  G(t)  —  1  +  0(|t|fc)  as  t  — *  0.  Put  h  ~  (2/  log  n)  j  , 


G[w,h )  =  (2n)~1  J  cos [tw / h) G(t)  exp {(t/h)2  / 2}  dt 


and  /( 1)  =  ( nh )  1  G(Y)  —  x,h),  where  {Yi,..  .,y„}  is  a  random  sample  from  the 
distribution  of  Y.  We  have  the  following  converse  to  Theorem  1. 


& 
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Theorem  2.  Assume  that  the  error  distribution  is  normal  N(0,l).  If  the  constants  an 
satisfy  (2.2),  and  if  f  is  the  kernel  estimator  just  defined,  then  (2.1)  holds  for  each  real 
number  x0  and  each  B  >  0. 


Theorem  2  is  easily  proved,  as  follows.  Put  K(y)  =  (27t)-1  /  eituG(t)  dt.,  a  real-valued 
function  integrating  to  unity.  Integrating  by  parts  k  +  2  times  we  see  that  |7f(y)|  < 
C(  1  +  |y|fc+2)-1.  By  Fourier  inversion,  f y1  K{y)  dy  =  0  for  1  <  j  <  k  —  1,  implying  that 
if  is  a  fc’th-order  kernel  (Prakasa  Rao  1983,  p.42).  If  Y  =  X  +  Z  then  E{G(Y  —  x,h)  \ 
X}  —  K{(X  —  x)/h},  so  that  the  mean  of  f[x)  is  the  same  as  that  of  a  classical  Ar’th-order 
kernel  estimator  based  on  an  n-sample  from  the  X-population.  Therefore  \Ef(x)  —  /(x)|  < 
Ci(B)hk  uniformly  in  /  €  Ck(B )  (Prakasa  Rao  1983,  p.47).  Furthermore, 

nh2var  {/(x)}  <  E{G[Y  —  x,  h)2}  =  E[E{G(Y  -  x,h)2  |  X}] 

<C2(B)  ff  exp[(2h2)_1{s2  +  t2  —  (s  +  i)2}]  dsdt 

<4C2(B)  f  f  exp(st/h2)  ds  dt  <  4C2(B)  exp(l/h2)  , 

Jo  Jo 

whence,  noting  that  h  =  (2/ log  n)» , 

SUP  P/d/C1)  -  /(*)|  >  On}  <  a;2  sup  {var  /(x)  +  \Ef(x)  —  /(x)|2} 
feck[B)  ;ec*(B) 

<  C3  [B)a~  2  {(nh2)- 1  e1^  +  h2k  }  -  0  . 

This  proves  Theorem  2. 

Finally  we  derive  Theorem  1.  To  simplify  notation  we  relocate  so  that  x0  =  0,  and 
rescale  so  that  Z  is  normal  N( 0,  j),  with  density  xp(z)  =  n~ke~*\  Let  a  >  1,  and 
write  /0  for  the  N(0,o2)  density;  l  for  the  integer  part  of  logn;  6y  =  2",{(2 
T)  =  l~k/2ek6B,  where  e,S  G  (0,  §-]  are  fixed;  and  for  Hermite  polynomials 

orthogonal  with  respect  to  xp.  The  following  properties  are  obtainable  from  Magnus  et  al. 
(1966,  p.252)  and  Sansone  (1959,  p.324):  H}  (-x)  =  (— 1  )yify(x); 

OO 

exp{2xcy  -  (cy)2}  =  V  H,  (x)(ty'y /j\  ; 


(2-3) 


I  Hi{x)Hj{x)e~xi  dx  =  if  *  =  j,  0  otherwise  ;  (2-4) 

J  H3i(x)x2’  t/>(x)dx  =  (2j)!/{4»-i(j  -*)!}  ;  (2‘5) 

\b}- H2j- (x)tp(x)\  <  C{  1  +  |x|5'2)e-I,/2  ;  (2-6) 

rjsup\(d/dx)kbiH2i{x/e)ip(x/e)\  <  CSB  ,  (2-7) 

where  C  depends  only  on  k. 

Put  /„(*)  =  f0{x)  +r}blH2l{x/e)xl>(x/e).  By  (2.6),  and  since  r?(n)  0  and  £  <  1  <  a, 

fn  is  a  density  for  large  n.  If  X  has  density  f0  or  /„  then  Y  =  X  +  Z  has  density  g0  or  gn 
respectively,  where  g0  is  the  N{0,a2  +  |)  density,  gn{x)  =  g0[x)  +  r)t>Mx )  and 

oo 

ht(x)=  [  H2l(y/e)xl>(y/e)tl>(x  -  y)dy  =  exl)(x)J^H2j{xV3  ‘  ’ 

J  }  =  i 

using  (2.3)  and  (2.5).  Since  il>{x)2  /  g0{x)  <  Cxt~ *  then 

I  =  /(J.  -  <  c.bM’  / *.(*)’«■’  <<*  <2'8) 

=  Ca£2fc  +  2^222,{(2/)!}_  1l*~k  ^(£4/4)3(2j)!{(i -/)!}" 2  , 

/=» 

using  (2.4).  But  {(2/)!}'1  <  C73 (/!)" 22" 2I/^ ,  (2j)!  <  C3(y!)22a0‘"  *  and  j!/(j  -  /)!  = 
(>)/!  <  2y /!.  Hence,  remembering  that  e  <  j, 

I  <C4e2k  +  262ll~k  f>e4)y.T’  <  C5(c,<5)/’-*(4£4)'  =  o^'1)  .  (2.9) 

/=» 

Given  5  >  0,  we  see  from  (2.6)  and  (2.7)  that  by  choosing  o  large  and  £  small,  not 
depending  on  B,  we  may  ensure  that  /0 ,  /„  6  Cfc  (B)  for  large  n.  For  an  event  A,  let  Pn  (A) 
and  P0(A)  denote  the  probability  of  A  under  /„  and  f0  respectively.  If  {a„}  satisfies  (2.1), 
then  by  (2.9)  and  Cauchy-Schwarz, 

[P»{|/(0)  -  fn  (0)|  <  an}]2  <  P0{|/(0)  -  A(0)|  <  a„}(l  +  I)n 

=  {l  +  o(l)}P0{|/(0)-/„(0)|<on}, 
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so  that  both  P0{|/(0)  —  /„(0)j  <  a„}  and  P0{j/(0)  —  /0(0)|  <  a„}  converge  to  one  as 
n  — ►  oo.  Hence  |/„(0)  —  /0(0)|  <  2 an  for  large  n.  But  |/„(0)  —  /0 (0) |  =  r?6,(2/)!//!7r »  > 
2CB(\ogn)~  k/2 ,  where  C  does  not  depend  on  B.  Therefore  an  >  CB(\ogn)~k^2  for  large 
n.  Since  this  is  true  for  each  B  >  0  then  (log  n)k!2an  — >  oo,  completing  the  proof  of 
Theorem  1. 

The  same  construction  can  be  used  to  show  that  if  k  =  2,  the  distribution  function  of 
X  can  be  estimated  at  a  rate  no  faster  than  (logn)-3/2.  Let  Fn  and  F0  be  the  distribution 
functions  for  fn  and  /0  in  the  proof  of  Theorem  1,  and  evaluate  them  at  ex0,  where  x0  >  0. 
The  calculations  rely  on  an  approximation  to  H2i-i(x0)  given  by  Magnus  et  al.  (1966, 
p.254)  and  various  integral  identities  on  p.  251  of  the  same  reference.  We  omit  the  details. 


3.  DECONVOLUTION  FOR  GENERAL  ERRORS 

There  are  versions  of  Theorems  1  and  2  for  a  variety  of  different  types  of  error  distri¬ 
butions.  The  general  principle  is  that  “the  smoother  the  residual  distribution,  the  slower 
is  the  optimal  achievable  rate  of  convergence”.  It  is  convenient  to  consider  this  principle 
in  the  Fourier  domain,  bearing  in  mind  that  smoother  distributions  have  characteristic 
functions  with  thinner  tails.  If  X,  Y  and  Z  have  respective  characteristic  functions  <f>x , 
(f>Y  and  <pz ,  and  if  Y  =  X  +  Z  where  X  and  Z  are  independent,  then  the  characteristic 
function  of  X  is  recoverable  from  that  of  Y  via  the  formula  <j>x  =  <pY  !4>z  •  Any  data-based 
form  of  this  inversion  becomes  increasingly  difficult  as  the  tails  of  <f>z  become  thinner. 
For  example,  if  Z  has  a  gamma  distribution  with  shape  parameter  a,  then  the  tails  of 
4>z  (0  decrease  like  |t|-°  as  |t|  — ►  oo,  and  so  deconvolution  is  difficult  for  large  a.  In  fact, 
the  fastest  achievable  rate  of  convergence  over  densities  in  Ck(B)  is  n- k/(*k  +  ‘2a  +  ‘),  Thug 
is  made  clear  by  the  following  analogue  of  Theorem  1.  Again,  /(x0)  is  a  nonparametric 
estimator  of  /(x0). 

Theorem  3.  Assume  that  the  error  distribution  is  gamma,  with  shape  parameter  a  >  0. 


If,  for  some  sequence  of  positive  constants  { a„ ,  n  >  1},  we  have 

liminf  inf  P/{|/(x0)  -  /(x0)|  <  a„}  =  1 

n  ” *  00  /  €  C  *  ( £  ) 

for  each  B  >  0,  then 

lim  nfc/(2fc  +  2a  +  1)on  =  +00  .  (3.1) 

n  — ♦  00 

The  “double  gamma”  case,  where  Z  is  symmetric  and  \Z\  is  gamma(a),  is  similar. 
There,  Theorem  3  continues  to  hold  for  integer  a,  provided  2 a  in  (3.1)  is  changed  to 
4(a  —  [a/ 2]),  where  [a/ 2]  denotes  the  largest  integer  not  exceeding  a/2.  In  particular, 
the  optimal  rate  of  convergence  when  errors  have  a  double  exponential  distribution  is 

n-fc/(2fc  +  6)_ 

Proofs  of  results  such  as  Theorem  3,  where  “algebraic”  rates  are  available,  run  as 
follows.  Let  £  — »  0  as  n  — »  00,  and  fix  a  fc-times  differentiable  density  f0  which  is  bounded 
away  from  zero  in  a  neighbourhood  of  the  origin.  Let  H  be  a  bounded,  compactly  supported 
function  with  at  least  k  bounded  derivatives,  and  satisfying  H(0 )  ^  0  and  f  x3  H[x)  dx  —  0 
for  0  <  j  <  oc  +  1.  Put  fn( x)  =  f0{x)  +  ekH(x/e),  and  let  gn  and  g0  be  the  convolution 
densities  for  f0  and  /„  respectively.  It  may  be  shown  that  if  e  =  n-1/(2fc+2o  +  1)  then  I, 
defined  at  (2.8),  satisfies  I  =  0(n" ').  Then,  arguing  much  as  in  the  proof  of  Theorem  1,  the 
best  attainable  rate  of  convergence  emerges  as  being  no  faster  than  ek .  Similar  techniques 
show  that  for  smooth,  infinitely  differentiable  error  densities  such  as  the  Cauchy,  the 
optimal  convergence  rate  is  slower  than  n~  0  for  any  a  >  0. 

Stefanski-Carroll  (1987b)  type  kernel  estimators  achieve  optimal  rates  in  the  normal, 
gamma  and  “double  gamma”  cases.  For  the  sake  of  brevity  we  have  omitted  a  proof  in 
the  latter  two  cases. 

4.  DISCUSSION 

Deconvolution  problems  are  important  in  their  own  right,  as  well  as  in  nonparametric 
estimation  of  priors.  In  measurement  error  models,  deconvolution  arises  if  one  wishes 
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to  use  data  to  suggest  models  for  the  unobservable  predictors  or  to  estimate  conditional 
moments  useful  in  likelihood  calculations.  When  the  measurement  errors  are  normally 
distributed,  our  results  are  pessimistic,  suggesting  that  it  will  be  difficult  to  deconvolve 
effectively  over  a  wide  class  of  distributions  for  X. 
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